Edge-Based Best-First Chart Parsing
نویسندگان
چکیده
Best-first probabilistic chart parsing attempts to parse efficiently by working on edges that are judged ~'best" by some probabilistic figure of merit (FOM). Recent work has used probabilistic context-free grammars (PCFGs) to assign probabilities to constituents, and to use these probabilities as the starting point for the FOM. This paper extends this approach to using a probabilistic FOM to judge edges (incomplete constituents), thereby giving a much finergrained control over parsing effort. We show how this can be accomplished in a particularly simple way using the common idea of binarizing the PCFG. The results obtained are about a factor of twenty improvement over the best prior results m that is, our parser achieves equivalent results using one twentieth the number of edges. Furthermore we show that this improvement is obtained with parsing precision and recall levels superior to those achieved by exhaustive parsing. 1 I n t r o d u c t i o n Finding one (or all) parses for a sentence according to a context-free grammar requires search. Fortunately, there are well known O(n 3) algorithms for parsing, where n is the length of the sentence. Unfortunately, for large grammars (such as the PCFG induced from the Penn II WSJ corpus, which contains around 1.6. i04 rules) and Iongish sentences (say, 40 words and punctuation), even O(n 3) looks pretty bleak. One well-known O(n 3) parsing method (Kay, 1980) is chart parsing. In this approach one maintains an agenda of items remaining to be " This material is based on work supported in past by NSF grants IRI-9319516 and SBR-9720368. and by ONR grant N0014-96.1-0549. processed, one of which is processed during each iteration. As each item is pulled off the agenda, it is added to the chart (unless it is already there, in which case it can be discarded) and used to extend and create additional items. In "exhaustive" chart parsing one removes items from the agenda in some relatively simple way (last-in, first-out is common), and continues to do so until nothing remains. A commonly discussed alternative is to remove the constituents from the agenda according to a figure of merit (FOM). The idea is that the FOM selects "good" items to be processed, leaving the ~'bad" ones-the ones that are not, in fact, part of the correct parse---sitting on the agenda. When one has a completed parse, or perhaps several possible parses, one simply stops parsing, leaving items remaining on the agenda. The time that would have been spent processing these remaining items is time saved, and thus time earned. In our work we have found that exhaustively parsing maximum-40-word sentences from the Penn II treebank requires an average of about 1.2 million edges per sentence. Numbers like this suggest that any approach that offers the possibility of reducing the work load is well worth pursuing, a fact that has been noted by several researchers. Early on, Kay (1980) suggested the use of the chart agenda for this purpose. More recently, the statistical approach to language processing and the use of probabilistic context-free grammars (PCFGs) has suggested using the PCFG probabilities to create a FOM. Bobrow (1990) and Chitrao and Grishman (1990) introduced best-first PCFG parsing, the approach taken here. Subsequent work has suggested different FOMs built from PCFG probabilities (Miller and Fox. 1994: Kochman and Kupin. 1991: Magerman and
منابع مشابه
Edge - Based Best - First Chart
Best-rst probabilistic chart parsing attempts to parse eeciently by working on edges that are judged \best" by some probabilistic gure of merit (FOM). Recent work has used probabilistic context-free grammars (PCFGs) to assign probabilities to constituents, and to use these probabilities as the starting point for the FOM. This paper extends this approach to using a probabilistic FOM to judge edg...
متن کاملA Tabulation-Based Parsing Method that Reduces Copying
This paper presents a new bottom-up chart parsing algorithm for Prolog along with a compilation procedure that reduces the amount of copying at run-time to a constant number (2) per edge. It has applications to unification-based grammars with very large partially ordered categories, in which copying is expensive, and can facilitate the use of more sophisticated indexing strategies for retrievin...
متن کاملA Parser for Portable NL Interfaces Using Graph-Unification-Based Grammars
This paper presents the reasoning behind the selection and design of a parser for the Lingo project on natural language interfaces at MCC. The major factors in the selection of the parsing algorithm were the choices of having a syntactically based grammar, using a graph-unification-based representation language, using Combinatory Categorial Grammars, and adopting a one-to-many mapping from synt...
متن کاملAn Effective Framework for Chinese Syntactic Parsing
This paper presents an effective framework for Chinese syntactic parsing, which includes two parts. The first one is a parsing framework, which is based on an improved bottom-up chart parsing algorithm, and integrates the idea of the beam search strategy of N best algorithm and heuristic function of A* algorithm for pruning, then get multiple parsing trees. The second is a novel evaluation mode...
متن کاملFigures of Merit for Best-First Probabilistic Chart Parsing
Best-first parsing methods for natural language try to parse efficiently by considering the most likely constituents first. Some figure of merit is needed by which to compare the likelihood of constituents, and the choice of this figure has a substantial impact on the efficiency of the parser. While several parsers described in the literature have used such techniques, there is no published dat...
متن کامل